Basic Statistics Reference

Author

Guy Schnidrig

This is a reference book for Basic Statistics and Projects in R course of the Public Health Sciences Course Program at the University of Bern.

1 Descriptive statistics

1.1 Data types

In R, variables can possess diverse types. For instance, it is crucial to differentiate between numbers and character strings, as well as tables and plain lists of numbers. The class function aids us in identifying the specific object type we are dealing with.

Code
x <- 42
class(x)
[1] "numeric"
Code
y <- "apple"
class(y)
[1] "character"

1.1.1 Data frames

In R, the prevalent approach for storing data frames involves utilizing a data frame. In terms of conceptualization, a data frame can be envisioned as a tabular structure where rows represent observations and columns define the various variables reported for each observation. Data frames prove highly advantageous for data frames due to their ability to combine different data types into a single object. To load the palmerpenguins1 data set into a data frame, we can use the data function in R. The data set that contains measurements of penguins from three different species.

Code
data(penguins)

We can now check the class of the object penguins. The tbl_df class is a subclass of data.frame, created in order to have different default behavior. The colloquial term “tibble” refers to a data frame that has the tbl_df class. Tibble is the central data structure for the set of packages known as the tidyverse.

Code
class(penguins)
[1] "tbl_df"     "tbl"        "data.frame"

Now we can inspect the first 10 rows with head

Code
head(penguins, 10)
# A tibble: 10 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 2 more variables: sex <fct>, year <int>

The str function in R provides a concise and informative display of the internal structure of an R object.

Code
str(penguins)
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

1.2 Mean

To calculate the mean of a variable in R, you can employ mean

Code
x = 1:10
mean(x)
[1] 5.5

We can calculate the mean of the variable “Bill length” in the penguin data set, by combining the mean and summarise. Because we have some NA values specify the na.rm argument of mean.

Code
penguins %>%
  summarise(Mean_Bill_Length = mean(bill_length_mm, na.rm = TRUE)) 
# A tibble: 1 × 1
  Mean_Bill_Length
             <dbl>
1             43.9

There are three different species in the penguins data set. To calculate the mean for each species we use group_by. Additionally we can use kable to format the output table.

Code
penguins %>%
  group_by(species) %>%                        
  summarise(Mean_Bill_Length = mean(bill_length_mm, na.rm = TRUE)) %>%
  kable(digits = 2)
species Mean_Bill_Length
Adelie 38.79
Chinstrap 48.83
Gentoo 47.50

1.3 Median

To calculate the median we can replicate the code from above and insert the function median.

Code
penguins %>%
  group_by(species) %>%                        
  summarise(Mean_Bill_Length = mean(bill_length_mm, na.rm = TRUE),
            Median_Bill_Length = median(bill_length_mm, na.rm = TRUE)) %>%
  kable(digits = 2)
species Mean_Bill_Length Median_Bill_Length
Adelie 38.79 38.80
Chinstrap 48.83 49.55
Gentoo 47.50 47.30

1.4 Variance

\(\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2\)

var computes the variance of x and y if these are vectors.

Code
penguins %>%
  group_by(species) %>%                        
  summarise(Mean_Bill_Length = mean(bill_length_mm, na.rm = TRUE),
            Median_Bill_Length = median(bill_length_mm, na.rm = TRUE),
            Variance_Bill_Length = var(bill_length_mm, na.rm = TRUE)) %>%
  kable(digits = 2)
species Mean_Bill_Length Median_Bill_Length Variance_Bill_Length
Adelie 38.79 38.80 7.09
Chinstrap 48.83 49.55 11.15
Gentoo 47.50 47.30 9.50

1.5 Standard deviation

\(\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}\)

Code
penguins %>%
  group_by(species) %>%                        
  summarise(Mean_Bill_Length = mean(bill_length_mm, na.rm = TRUE),
            Median_Bill_Length = median(bill_length_mm, na.rm = TRUE),
            Variance_Bill_Length = var(bill_length_mm, na.rm = TRUE),
            SD__Bill_Length = sd(bill_length_mm, na.rm = TRUE)) %>%
  kable(digits = 2)
species Mean_Bill_Length Median_Bill_Length Variance_Bill_Length SD__Bill_Length
Adelie 38.79 38.80 7.09 2.66
Chinstrap 48.83 49.55 11.15 3.34
Gentoo 47.50 47.30 9.50 3.08

1.6 The Interquartile Range

\(IQR = Q3 - Q1\)

Code
summary_stat <- penguins %>%
  group_by(species) %>%                        
  summarise(Mean_Bill_Length = mean(bill_length_mm, na.rm = TRUE),
            Median_Bill_Length = median(bill_length_mm, na.rm = TRUE),
            Variance_Bill_Length = var(bill_length_mm, na.rm = TRUE),
            SD_Bill_Length = sd(bill_length_mm, na.rm = TRUE),
            IQR_Bill_Length = IQR(bill_length_mm, na.rm = TRUE))

summary_stat %>%
  kable(digits = 2)
species Mean_Bill_Length Median_Bill_Length Variance_Bill_Length SD_Bill_Length IQR_Bill_Length
Adelie 38.79 38.80 7.09 2.66 4.00
Chinstrap 48.83 49.55 11.15 3.34 4.73
Gentoo 47.50 47.30 9.50 3.08 4.25

1.7 Correlation

\(\rho_{xy} = \frac{{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}}{{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}}\)

The cor function computes the correlation coefficient, which measures the strength and direction of the linear relationship between two numeric variables. If your data contains missing values, you can handle them appropriately when calculating the correlation in R. The cor function has an argument called use, which allows you to specify how to handle missing values. For this example we only include rows with complete observations.

Code
cor_matrix <- penguins %>% 
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  cor(use = "complete.obs")

cor_matrix %>%
  kable(digits = 2)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm 1.00 -0.24 0.66 0.60
bill_depth_mm -0.24 1.00 -0.58 -0.47
flipper_length_mm 0.66 -0.58 1.00 0.87
body_mass_g 0.60 -0.47 0.87 1.00

2 Data visualization

In order to utilize ggplot2, it is necessary to acquire familiarity with multiple functions and arguments. Remembering all of these can be challenging, hence we strongly advise keeping the ggplot2 cheat sheet accessible. You can obtain a copy from this link.

With ggplot we have many ways to customize the look of our plot. Try to experiment with: fill color xlab ylab ggtitle labs xlim ylim scale_fill_manual scale_color_manual theme theme_bw.

2.1 Histogram

Histograms are an easy way to see how your data is distributed. They provide a visual representation of the frequency or count of data points falling within specific intervals, known as bins.

Code
penguins %>%
  ggplot(aes(bill_length_mm)) +
  geom_histogram(color = "black",
                 fill = "steelblue", 
                 bins = 30) + 
  theme_bw()

Alternatively you can use geom_density to get a quick overview over the distribution.

Code
penguins %>%
  ggplot(aes(bill_length_mm)) +
  geom_density(color = "black",
                 fill = "steelblue", 
                 bins = 30) + 
  theme_bw()

2.2 Box plot

Box plots, also known as box-and-whisker plots, are useful statistical tools for visualizing and summarizing the distribution of a data set. They provide a concise summary of several important characteristics of the data, including the center, spread, skewness, and potential outliers.

Code
penguins %>%
  ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
  geom_boxplot() + 
  theme_bw()

2.3 Scatter plot

A scatter plot is a great way to visualize relationships and differences between variables and groups within your data sets. In our data set we detect that bill length seems to have a linear relationship with body mass.

Code
penguins %>%
  ggplot(aes(x = body_mass_g, y = bill_length_mm, color = species)) +
  geom_point() + 
  theme_bw()

2.4 Correlation plot

An easy way to plot a correlation matrix is to use ggcorrplot.

Code
ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower", lab = TRUE)

3 Statistical calculations

3.1 z-transformation

\(z = \frac{{x - \mu}}{{\sigma}}\)

\(x\) represents the individual data point.

\(\mu\) is the population mean.

\(\sigma\) is the population standard deviation.

To apply a z-transformation in R we can use scale. This will set the mean to 0 and the standard deviation to 1. We can either replace our original value or create a new variable in our data set. With head we just look at the top of the tibble.

Code
penguins %>% 
  group_by(species) %>% 
  mutate(z_score_1 = scale(flipper_length_mm),
         z_score_2 = (flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE))/sd(flipper_length_mm, na.rm = TRUE)) %>%
  dplyr::select(species, flipper_length_mm, z_score_1, z_score_2) %>%
  head() 
# A tibble: 6 × 4
# Groups:   species [1]
  species flipper_length_mm z_score_1[,1] z_score_2
  <fct>               <int>         <dbl>     <dbl>
1 Adelie                181      -1.37     -1.37   
2 Adelie                186      -0.605    -0.605  
3 Adelie                195       0.772     0.772  
4 Adelie                 NA      NA        NA      
5 Adelie                193       0.466     0.466  
6 Adelie                190       0.00709   0.00709

3.2 Calculating probabilities using pnorm

\(P(X \leq x) = \Phi(x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(t-\mu)^2/2\sigma^2} dt\)

By utilizing pnorm, we can determine the likelihood of values exceeding 0.2.

Code
pnorm(0.2, mean = 1, sd = 2, lower.tail = FALSE)
[1] 0.6554217

To calculate the probability that a bill length is equal to or less than 42.

Code
pnorm(42, mean(penguins$bill_length_mm, na.rm = TRUE), mean(penguins$bill_length_mm, na.rm = TRUE), lower.tail = TRUE)
[1] 0.4825487

We can do this for each species with group_by. Here we calculate the probability that a bill length is equal to or less than 42.

Code
penguins %>%
  group_by(species) %>%
  mutate(prob = pnorm(42,
                      mean = mean(bill_length_mm, na.rm = TRUE), 
                      sd = mean(bill_length_mm, na.rm = TRUE),
                      lower.tail = TRUE)) %>%
  distinct(species,prob) %>%
  kable(digits = 2)
species prob
Adelie 0.53
Gentoo 0.45
Chinstrap 0.44

3.3 Manual confidence interval

\(CI = \bar{x} \pm z \frac{s}{\sqrt{n}}\)

\(\bar{x}\) represents the sample mean
\(s\) represents the sample standard deviation
\(n\) represents the sample size
\(z\) represents the critical value from the standard normal distribution corresponding to the desired confidence level

A confidence interval is a range of values that is constructed from sample data and is used to estimate an unknown population parameter. It provides a measure of the uncertainty associated with the estimated parameter.

To calculate the confidence interval by hand we start by defining the confidence level and calculating n.
By using ( ) around and $ at end of our pipes we can access the vector which has removed all NA with drop_na.

Code
confidence_level <- 0.95

n <- length((penguins %>%
              dplyr::select(bill_length_mm) %>%
              drop_na)$bill_length_mm)

Calculate sample mean and standard deviation.

Code
bill_length_mean <- mean(penguins$bill_length_mm, na.rm = TRUE)
bill_length_sd <- sd(penguins$bill_length_mm, na.rm = TRUE)

Alternatively we can access our summary data frame if we are interested in only one species.
With the [ ] we can access the first element of the means which corresponds to the Adélie species.

Code
bill_length_mean_adelie <- summary_stat$Mean_Bill_Length[1]
bill_length_sd_adelie <- summary_stat$SD_Bill_Length[1]

The next step is to define the standard error.

Code
standard_error <- bill_length_sd / sqrt(n)

We then compute the t-score related to the confidence level.
First we set alpha as 1 - confidence level.
Then we define our degrees of freedom as n - 1.

Code
alpha = 1 - confidence_level

degrees_of_freedom = n - 1

t_score = qt(p = alpha / 2, df = degrees_of_freedom, lower.tail= FALSE)

Calculate the margin error.

Code
margin_error <- t_score * standard_error

Now we can calculate and print the lower and upper bound confidence interval.

Code
lower_bound <- bill_length_mean - margin_error
upper_bound <- bill_length_mean + margin_error

To print text and stored variables we can use cat.

Code
cat("The", confidence_level * 100, "% confidence interval is [", lower_bound, ",", upper_bound, "].")
The 95 % confidence interval is [ 43.34125 , 44.50261 ].

3.4 One-Sample T-Test

If you have a single sample and want to test if its mean is significantly different from a hypothesized value, you can use t_test. The function calculates a t-statistic and provides a p-value.

Filter the data to only include Adélie penguins.

Code
adelie <- penguins %>%
  filter(species == "Adelie")

Perform a one-sample t-test on the body mass of Adélie penguins

Code
adelie %>%
  t_test(body_mass_g ~ 1) %>%
  kable()
.y. group1 group2 n statistic df p
body_mass_g 1 null model 151 99.16672 150 0

3.5 Two-Sample T-Test

If you have two independent samples and want to compare their means, you can use the t_test with two sample inputs. The function calculates a t-statistic and provides a p-value.

Filter the data to only include Adélie and Chinstrap penguins.

Code
adelie_chinstrap <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap"))

Perform a two-sample t-test on the body mass of Adélie and Chinstrap penguins.

Code
adelie_chinstrap %>%
  t_test(body_mass_g ~ species) %>%
  kable()
.y. group1 group2 n1 n2 statistic df p
body_mass_g Adelie Chinstrap 151 68 -0.5430902 152.4548 0.588

To inspect our result we can use ggplot. To plot the test significance on our plot we can use stat_compare_means.

Code
adelie_chinstrap %>%
  ggplot(aes(x = species, body_mass_g, fill = species)) +
  geom_boxplot() +
  theme_bw() +
  stat_compare_means(aes(label = after_stat(p.signif)), method = "t.test", ref.group = "Chinstrap")

3.6 Chi-Square Test

If you have categorical data and want to test the independence of two variables, you can use the chisq_test. The function calculates a chi-square statistic and provides a p-value.

Perform a chi-square test on the species and island variables.

Code
table <- table(penguins$species, penguins$island)

table %>%
  chisq_test() %>%
  kable()
n statistic p df method p.signif
344 299.5503 0 4 Chi-square test ****

4 Non-Normality

4.1 Normality Tests

To see what normal data is supposed to look like, we start by randomly generate numbers with rnorm. We can specify what the mean and the standard deviation of the generated data should be.

Code
set.seed(42)

normal_data <- as.data.frame(rnorm(1000)) %>%
  rename(random_numbers = `rnorm(1000)`) 

normal_data %>%
  head(10) %>%
  kable(align = "l")
random_numbers
1.3709584
-0.5646982
0.3631284
0.6328626
0.4042683
-0.1061245
1.5115220
-0.0946590
2.0184237
-0.0627141

Now we create data that follows an exponential distribution with rexp.

Code
exp_data <- as.data.frame(rexp(1000, rate=3)) %>%
  rename(random_numbers = `rexp(1000, rate = 3)`)

exp_data %>%
  head(10) %>%
  kable(align = "l")
random_numbers
0.1013135
0.2227730
0.1562810
0.0979355
0.2285855
0.5544748
0.3826574
0.2645057
0.6935407
0.6574124

Create histograms for both data sets.

Code
normal_data %>%
  ggplot(aes(random_numbers)) +
  geom_histogram(color = "black",
                 fill = "steelblue",
                 bins = 25) +
  theme_bw()

Code
exp_data %>%
  ggplot(aes(random_numbers)) +
  geom_histogram(color = "black",
                 fill = "steelblue",
                 bins = 25) +
  theme_bw()

To analyze visually analyze normality we can create Q-Q plot for both data sets.

Code
normal_data %>%
  ggplot(aes(sample = random_numbers)) + 
           geom_qq_line(distribution = stats::qnorm) +
           geom_qq(color = "steelblue", distribution = stats::qnorm) + 
           xlab("Theoretical Quantiles") +
           ylab("Sample Quantiles") +
           theme_bw()

Code
exp_data %>%
  ggplot(aes(sample = random_numbers)) + 
           geom_qq_line(distribution = stats::qnorm) +
           geom_qq(color = "steelblue", distribution = stats::qnorm) + 
           xlab("Theoretical Quantiles") +
           ylab("Sample Quantiles") +
           theme_bw()

Perform shapiro-wilk test using shapiro_test. We can not reject the null hypothesis because p-value is smaller 0.05.

Code
normal_data %>%
  shapiro_test(random_numbers) %>%
  kable()
variable statistic p
random_numbers 0.9988187 0.7669591

Perform shapiro-wilk test fo exponetial data using shapiro.test. We can reject the null hypothesis because p-value is lower than 0.05.

Code
exp_data %>%
  shapiro_test(random_numbers) %>%
  kable()
variable statistic p
random_numbers 0.8243563 0

For sample size larger than 50 we should use the kolmogorov-smirnov test using ks.test. Because the p-value is less than 0.05, we can reject the null hypothesis. The sample data does not come from a normal distribution.

Code
ks.test(normal_data, 'pnorm')

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  normal_data
D = 0.025285, p-value = 0.5448
alternative hypothesis: two-sided
Code
ks.test(exp_data, 'pnorm')

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  exp_data
D = 0.50003, p-value < 0.00000000000000022
alternative hypothesis: two-sided

4.2 Normal Data: Analysis of Variance (ANOVA)

If you have multiple groups and want to test if their means are significantly different, you can use the function aov followed by the summary to obtain an analysis of variance table with p-values.

Filter the data to only include Adélie, Chinstrap, and Gentoo penguins.

Code
adelie_chinstrap_gentoo <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap", "Gentoo"))

Perform an ANOVA test on the body mass of Adélie, Chinstrap, and Gentoo penguins.

Code
adelie_chinstrap_gentoo %>% 
  anova_test(body_mass_g ~ species) %>%
  kable()
Warning: NA detected in rows: 4,272.
Removing this rows before the analysis.
Effect DFn DFd F p p<.05 ges
species 2 339 343.626 0 * 0.67

R package rstatix provides additional functions for Anova and repeated measures Anova

4.3 Non-Normal Data: Mann-Whitney U Test (Mann-Whitney-Wilcoxon test, 2 Groups only)

The Mann-Whitney U test is used to compare two independent non-Normally distributed groups. In R, you can use the wilcox.test function to perform this test. By default, the option paired = FALSE for independent data. See documentation for more details.

Filter the data to only include Adélie and Chinstrap penguins.

Code
adelie_chinstrap <-  penguins %>%
  filter(species %in% c("Adelie", "Chinstrap"))

Perform a Mann-Whitney U test on the body mass of Adélie and Chinstrap penguins.

Code
adelie_chinstrap %>% 
  wilcox_test(body_mass_g ~ species) %>%
  kable()
.y. group1 group2 n1 n2 statistic p
body_mass_g Adelie Chinstrap 151 68 4831 0.485

4.4 Non-Normal Data: Kruskal-Wallis Test

The Kruskal-Wallis test is used to compare the distribution of more than two independent non-Normal groups. In R, you can use the kruskal_test to perform this test.

Code
penguins %>% 
  kruskal_test(body_mass_g ~ island) %>%
  kable()
.y. n statistic df p method
body_mass_g 344 130.0696 2 0 Kruskal-Wallis

4.5 Multiple Comparison Corrections

When conducting multiple comparisons, it is important to account for the increased likelihood of obtaining false positives. One of the most commonly use corrections is the Bonferroni correction.

Normal data: To perform an Anova for pairwise group comparisons, with Bonferroni correction, use the function:

Code
treatment_outcome <- data.frame(
  treatment = rep(c("A", "B", "C"), each = 20),
  outcome = c(rnorm(20, mean = 2, sd =2),
              rnorm(20, mean = 6, sd = 4),
              rnorm(20, mean = 4, sd = 0.8)),
  patient_id = rep(1:20, times = 3))

treatment_outcome %>%
  pairwise_t_test(outcome ~ treatment, p.adjust.method = "bonferroni")
# A tibble: 3 × 9
  .y.     group1 group2    n1    n2           p p.signif      p.adj p.adj.signif
* <chr>   <chr>  <chr>  <int> <int>       <dbl> <chr>         <dbl> <chr>       
1 outcome A      B         20    20 0.000000724 ****     0.00000217 ****        
2 outcome A      C         20    20 0.105       ns       0.314      ns          
3 outcome B      C         20    20 0.00024     ***      0.000721   ***         

Non-Normal Data: To perform a Kruskal Wallis test with Bonferroni correction, use the Dunn Test with the option method. The Dunn Test performs the post hoc pairwise multiple comparisons procedure appropriate to follow up a Kruskal-Wallis test, which is a non-parametric analog of the one-way ANOVA.

Code
treatment_outcome_pois <- data.frame(
  treatment = rep(c("D", "E", "F"), each = 20),
  outcome = c(rpois(20, lambda = 0.1),
              rpois(20, lambda = 1),
              rpois(20, lambda = 6)))

treatment_outcome_pois %>% 
  kruskal_test(treatment ~ outcome) %>%
  kable()
.y. n statistic df p method
treatment 60 47.09886 10 0.0000009 Kruskal-Wallis
Code
treatment_outcome_pois %>% 
  dunn_test(treatment ~ outcome, p.adjust.method = "bonferroni") %>%
  head(10) %>%
  kable()
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
treatment 0 1 25 10 1.687856 0.0914389 1.0000000 ns
treatment 0 2 25 7 2.856371 0.0042851 0.2356825 ns
treatment 0 3 25 3 3.418819 0.0006289 0.0345915 *
treatment 0 4 25 2 2.842677 0.0044736 0.2460498 ns
treatment 0 5 25 4 3.879051 0.0001049 0.0057676 **
treatment 0 6 25 3 3.418819 0.0006289 0.0345915 *
treatment 0 7 25 1 2.048367 0.0405240 1.0000000 ns
treatment 0 8 25 2 2.842677 0.0044736 0.2460498 ns
treatment 0 9 25 1 2.048367 0.0405240 1.0000000 ns
treatment 0 11 25 2 2.842677 0.0044736 0.2460498 ns

5 Dependent Data

5.1 Intraclass Correlation Coefficient ICC

One easy way to calculate the ICC is to first calculate an Anova table that renders the Sum Of Squares (SS) for the explanatory (or grouping or pairing) variable considered. The SS are the partition of the variability between groups/panels and within groups/panels. Then we can use the simple formula below to obtain the ICC.

\(ICC = \frac{ss_{between}}{ss_{betwenn} - ss_{within}}\)

Load the data self esteem from R package datarium . The data set contains 10 individuals’ self-esteem score on three time points during a specific diet to determine whether their self-esteem improved.

Code
data("selfesteem", package = "datarium")
head(selfesteem, 5)
# A tibble: 5 × 4
     id    t1    t2    t3
  <int> <dbl> <dbl> <dbl>
1     1  4.01  5.18  7.11
2     2  2.56  6.91  6.31
3     3  3.24  4.44  9.78
4     4  3.42  4.71  8.35
5     5  2.87  3.91  6.46
Code
selfesteem <- selfesteem %>%
  gather(key = "time", value = "score", t1, t2, t3) %>%
  convert_as_factor(id, time)

head(selfesteem, 5) %>%
  kable()
id time score
1 t1 4.005027
2 t1 2.558124
3 t1 3.244241
4 t1 3.419538
5 t1 2.871243

For the ICC calculation, we are not interested in the differences over time. Note that we calculate the anova table just to obtain the SS’s for the groups. In this case, the grouping variable is the id of every person.

Code
anova_table_se <- aov(score ~ id, data = selfesteem)
summary(anova_table_se)
            Df Sum Sq Mean Sq F value Pr(>F)
id           9   4.57   0.508   0.085      1
Residuals   20 119.08   5.954               
Code
#ICC <- ss-between / (ss-between + ss-within)
ICC <- 4.57/(4.57+119.08)
cat("ICC is:",ICC)
ICC is: 0.03695916

In this case, the ICC is very small, so most of the variance is not at the person level but over time. There is a clear increasing trend over time.

Alternatively you can use anova_test.

Code
selfesteem %>%
  anova_test(score ~ id) %>%
  kable()
Effect DFn DFd F p p<.05 ges
id 9 20 0.085 1 0.037

5.2 Plotting Paired Data

R package PairedData provides functions to plot line-boxplots (profile plots), Bland-Altman plots and sliding chart plots

Bland-Altman plot to measure agreement between two methods:

Code
data("anscombe2")
with(anscombe2,plot(paired(X2,Y2),type="BA"))

5.3 Non-Normal Paired 2 Groups Comparison: Wilcoxon Signed-Rank Test

Because paired data violate the assumption of data independence, we need to perform a non-parametric test, the Wilcoxon signed-rank test, with R function wilcox.test with the option paired = TRUE. This test only works for two groups.

Example data set: weight of the same 10 mice before and after a certain treatment.

First, we need to put the data into two numeric vectors

Code
# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame
my_data <- data.frame( 
                group = rep(c("before", "after"), each = 10),
                weight = c(before,  after))
Code
res <- wilcox.test(before, after, paired = TRUE)
res

    Wilcoxon signed rank exact test

data:  before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0

Print only the p-value

Code
res$p.value
[1] 0.001953125

5.4 Multiple Comparisons with Paired Data

R package rstatix provides a series of functions to work with paired data or repeated measures, several groups and correct for multiple comparisons.

Code
df <- ToothGrowth
df$dose <- as.factor(df$dose)

Use function t_test to obtain p-values corrected for multiple comparisons, here we set paired = t

Code
pairwise.test <- df %>% t_test(len ~ dose, paired = TRUE)
pairwise.test %>%
  kable()
.y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
len 0.5 1 20 20 -6.966881 19 0.0000012 0.0000025 ****
len 0.5 2 20 20 -11.291494 19 0.0000000 0.0000000 ****
len 1 2 20 20 -4.604647 19 0.0001930 0.0001930 ***
Code
df %>%
  ggboxplot(x = "dose", y = "len", fill = "steelblue") +
  stat_pvalue_manual(
    pairwise.test, label = "p.adj.signif", 
    y.position = c(29, 35, 39)) + theme_bw()